Parallel Evaluation of Functional Programs: The (v, G)-Machine Approach

نویسنده

  • Thomas Johnsson
چکیده

For a number of years, this author together with Lennart Augustsson have been developing fast implementations of lazy functional languages, based on graph reduction, for ordinary (sequential) computers. Our approach to sequential implementation can be summarised very brie y as follows. Our approach stems from Turner's S, K, I standard combinator reduction approach [Tur79]. But instead of using a standard, xed set of combinators, the compiler tranforms the program into a new set of specialised combinators, or`super-combinators' [Hug82]. This transformation process is called lambda lifting [Joh85]. Each of these super-combinators are then compiled into machine code for the machine at hand, this code implements the graph rewrite rule that the combinator implies. Put another way, the compiler constructs a specialised, machine-language coded combinator interpreter from each program. However, rather than compiling each combinator into machine code directly, we rst compile them into code for an abstract machine, the G-machine [Joh84]. Also, rather than letting the code rewrite teh graph for a combinator application into the graph of the right hand side of the combinator de nition, quite a lot of improvements to this scheme is possible The G-machine is a convenient abstraction for expressing these improved compilation scheme. An overview of the techniques used in our compiler for Lazy ML can be found in [Aug84] and [AJ89b]. the compilation of pattern matching into e cient code is described in [Aug85]. Our method of lambda lifting is described in [Joh85, Joh87], and the Gmachine is described in [Joh84]. The approach to machine code generation used in the LML compiler is described in [Joh86]. Parallel computers, consisting of many (dozens, hundreds, or thousands) processors connected to either a shared memory or a message passing network, are now becoming available on the marketplace. Recently, we have done work on extending the G-machine techniques to perform parallel graph reduction on such computers [AJ89a]. It is possible to modify the sequential G-machine into a parallel one straighforwardly, by having multiple threads of control (of course) all of which perform graph reduction in a common graph | this is the shared memory model. Further, in this straighforward extention there would be a pair of stacks, or perhaps only a single stack, for each thread of control. Such systems have been designed and implemented by Maranget [Mar91] and [Geo89], apparently with good performance. This is also the approach taken in the GRIP project [Jon87, JCS89]. However, there are some properties of the standard G-machine that made us want to try a di erent approach for a parallel implementation. Firstly, in the G-machine, when reduction of a function application starts, the arguments of the application node (either in the form of a chain of binary application node, or a vector application node) are moved to the stack, and when reduction is nished the result is moved back into the heap by updating the root application node with the value of the function application | this seems like a lot of unneccesary data movement when the datum could have been accessed from the node in the rst place (this argument has nothing to do with parallelism, of course). Secondly, the prospect of having to manage a cactus stack was not very appealing, we yearned for something simpler. In the machines we would like to consider, and in particular the machine we have implemented our parallel graph reducer on, the Sequent Symmetry tm , a memory reference into the heap has the same cost as a reference into a stack, since they reside in the same (shared) memory. Thus the cost of moving a word to the heap while building a node is the same as pushing a word into the stack. Thus, in the abstract machine we have designed, called the h ; Gi-machine, function application are represented by frame nodes, which hold the arguments of the function application, a pointer to the code for the function being applied, but in addition also contains enough space for temporaries needed for reduction of the function application. Figure 1 shows what happens when EVAL is called: the `current point of reduction' is moved to the frame node to be evaluated, and a `dynamic link' eld is set to point back to the frame which called did the eval. Thus, instead of an ordinary stack we have a linked list of stack frames. In the parallel case, we have many points of reduction. For further details of the abstract machine, see [AJ89a]. FR FR EVAL.c . . . FR FR c . . . Figure 1: Calling EVAL to reduce a frame node. To be fair, the stack model has some advantages too. The spineless G-machine [BRJ88], which o ers a more general and e cient tail call mechanism than the `standard G-machine' in particular when dealing with higher order functions, requires essentially an arbitrarily big stack. Lester [Les89] has devised an analysis technique based on abstract interpretation, to determine the maximum size a stack might have under the `spineless' evaluation regime. Thus it would be possible to merge the h ; Gi-machine model with the `spineless' model of execution, by allocating a frame node of the required maximum size. Both the stack model and the frame node model have their advantages, and it is too early to nominate an overall winner. So far, to introduce parallelism in the LML programs the programmer has to write spark annotations [CJ86] in the programs explicitly. The spark annotation is advisory: if there is a processor available then it may evaluate the sparked expression, otherwise the process really needing the value will evaluate it itself. Code generation now works rather di erently from the way it was described in [AJ89a]. Code is generated by rst translating the combinators into three-address form, with liberal use of temporary names. We illustrate this with the code for the combinator f x y z = x y, which is: funstart f 3 start of f, which takes 3 args load t0,0(nu) load z from frame into t1 load t1,1(nu) load y from frame into t2 load t2,2(nu) load x from frame into t3 move t1,t4 move t4,t5 store t5,0(nu) store the arg y of the application into the frame eval t2 evaluate x, the function move t2,t3 do t3,1 tail call, function is x, one arg in frame The code for a combinator starts by loading all arguments into temporaries from the frame node. Then in the example above the argument y of the tail call, is moved into the current frame at the location of the last argument; the function x of the tailcall is evaluated into function form, and nally the tail call is performed with the general tail call instruction do. This `raw' code is then subjected to various improvement transformations, for instance, the loads and stores are moved around to minimise the number om live variables across eval. Finally, temporaries are bound to machine registers. The result looks like this: funstart f 3 start of f, which takes 3 args load r0,2(nu) load x from frame into register r0 eval r0 evaluate x, the function, in r0 load r4,1(nu) load y from frame into register r4 ... store r4,0(nu) ... and store y, the arg of the application into the frame do r0,1 tail call, function is x, one arg in frame From this code the actual machine code i generated. A notable feature of the generated code is that we have abandoned the method of coding the tag as a pointer, either to a table (as described in [Joh86]) or to code directly, as in the spineless tagless G-machine [?]. Instead, the tag word contains various tag bits. The reason comes from two observations: rstly, most of the time when doing eval the node is allready canonical | according to measurements 80% of the time is typical. Secondly, in most modern architectures with instruction prefetch, it is rather costly to break the sequential ow of control. Therefore eval is implemented with code that tests a canonical-bit in the tag eld of the node to be evaluated; if canonical the next instruction is executed, and only if it is not canonical does a jump occur to code that performs the actual call to the eval routine. The call to eval is surrounded by code that stores and reloads the content of live registers. Our implementation of the parallel h ; Gi-machine is for the Sequent Symmetry, a bus-based shared memory machine. The architecture supports up to 30 processors connected to the bus; our machine 16 processors. This machine has some features that helps very much in the implementation of the parallel h ; Gi-machine, for instance, any cell in the memory can be used as an atomic lock. At the moment of writing this, a new garbage collector is being tested [R oj91]. The Appel-Ellis-Li garbage collector is an e cient real-time copying garbage collector which runs concurrently with the mutator processes. R ojemo has extended it to collect also processes which have become garbage. Since the publication of [AJ89a] we have improved the performance somewhat due to the improved code generation method, as described brie y above. The improvement is about 25% for code purely sequential code, but for parallel programs the improvement is less than that { depending on how big a proportion of the time is spent in activities like syncronisation, task switching etc. Figure 2 shows the current speedup charts for three benchmark programs. Garbage collection time is not included here. no of processors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 sp ee du p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 nfib euler 10q no of processors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 sp ee du p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 nfib euler 10q Figure 2: Speedup graphs for three benchmark programs: The left graph shows the speedup relative to one processor, the right graph shows the speedup relative to the `standard G-machine' in the LML compiler.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel execution of functional programs on loosely coupled multiprocessor systems

It has been suggested that functional programs are suitable for programming parallel computers owing to their inherent parallelism. We propose a parallel evaluation model of functional programs based on the STG(Spineless Tagless G-machine) model proposed for sequential evaluation, and describe our parallel implementation of a functional language Gofer on the AP1000 parallel computer.

متن کامل

Constraint Functional Multicore Programming

In this paper we present the concurrent constraint functional programming language CCFL and an abstract machine for the evaluation of CCFL programs in a multicore environment. The source language CCFL is a simple lazy functional language with a polymorphic type system augmented by ask-/tell-constraints and conjunctions to express concurrent coordination patterns. As execution model for CCFL we ...

متن کامل

Forward kinematic analysis of planar parallel robots using a neural network-based approach optimized by machine learning

The forward kinematic problem of parallel robots is always considered as a challenge in the field of parallel robots due to the obtained nonlinear system of equations. In this paper, the forward kinematic problem of planar parallel robots in their workspace is investigated using a neural network based approach. In order to increase the accuracy of this method, the workspace of the parallel robo...

متن کامل

AData ow-basedMassivelyParallelProgrammingLanguage "V" and Its Implementation on A Stock Parallel Machine

We propose a data BLOCKINow-based massively parallel programming language, called \V," which would minimize the diculties in writing massively parallel programs. The language V has both merits of functional programming and object-based programming. Our starting point is a data BLOCKINow-based functional programming language, called \Valid," which we have developed so far, because functional pro...

متن کامل

A New Hybrid Meta-Heuristics Approach to Solve the Parallel Machine Scheduling Problem Considering Human Resiliency Engineering

This paper proposes a mixed integer programming model to solve a non-identical parallel machine (NIPM) scheduling with sequence-dependent set-up times and human resiliency engineering. The presented mathematical model is formulated to consider human factors including Learning, Teamwork and Awareness. Moreover, processing time of jobs are assumed to be non-deterministic and dependent to their st...

متن کامل

Calculating Lenient Programs' Performance

Lenient languages, such as Id Nouveau, have been proposed for programming parallel computers. These languages represent a compromise between strict and lazy languages. The operation of parallel languages is very complex; therefore a formal method for reasoning about their performance is desirable. This paper presents a non-standard denotational semantics for calculating the performance of lenie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1991